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PREFACE 
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aircrew  selection  and  classification.  The  authors  thank  Drs  Malcolm  James 
Ree,  Joseph  L.  Weeks,  and  William  E.  Alley  for  their  many  helpful  comments 
and  suggestions. 


Treatment  of  Outliers  in  Cognitive 
and  Psychomotor  Test  Data 


SUMMARY 

Many  statistical  tests  and  review  articles  have  pointed  out  the  possible 
adverse  effects  that  outliers  can  have  on  model  parameter  estimates,  and  have 
suggested  several  methods  for  detecting  and  treating  outliers.  In  the  present 
study,  the  effects  of  two  different  methods  for  treating  outliers  in  aptitude  tests 
(data  deletion  and  data  transformation)  were  investigated  at  the  item  and 
total-score  level  on  the  internal  consistency  and  criterion-related  validity  of  six 
computerized  tests  being  evaluated  by  the  U.S.  Air  Force.  Over  2,000  pilot 
training  candidates  were  tested.  Results  indicated  that  neither  outlier  treatment 
method  at  either  level  of  analysis  had  significant  effects  on  tests’  psychometric 
characteristics.  Possible  reasons  for  *nese  findings  include  the  rarity  with  which 
outliers  actually  occur,  and  the  robustness  of  linear  modeling  methods. 


INTRODUCTION 

Data  points  which  lie  apart  from  the  majority  of  the  data,  or  outliers,  can 
have  a  significant  impact  on  model  parameter  estimates  (Chatterjee  &  Hadi, 
1986;  Cook  &  Weisberg,  1982;  Maier,  1988;  Neter,  Wasserman,  &  Kutner,  1990; 
Stevens,  1984).  There  may  be  many  sources  of  outlying  data  points  including 
(a)  collecting  data  from  subjects  (e.g.,  pre-teens)  who  are  not  members  of 
targeted  population  (adults),  (b)  extreme  contributions  of  random  error 
components,  (c)  data  recording  errors,  and  (d)  errors  in  data  preparation  (Orr, 
Sackett,  &  Dubois,  1991). 

Although  several  methods  for  detecting  and  treating  outliers  have  been 
developed  (see  Belsley,  Kuh,  &  Welsh,  1900  and  Chatterjee  &  Hadi,  1986  for 
reviews),  issues  of  outlier  detection  and  treatment  have  received  scant  attention 
in  the  human  resource  management  literature.  As  one  example,  Orr  et  al. 
(1991)  reviewed  100  selection  validation  studies  cited  by  Schmitt,  Gooding,  Noe, 
and  Kirsh  (1984)  which  had  been  published  between  1964  and  1982  in  Personnel 
Psychology  and  the  Journal  of  Applied  Psychology.  According  to  Orr  et  al. 
(1991),  “not  a  single  study  mentioned  looking  for,  finding,  or  removing  outlying 
data"  (p.  475).  This  is  surprising,  given  the  influence  that  a  small  number,  or 
even  one,  data  point  could  have  on  model  parameter  estimates  (Cook  & 
Weisberg,  1982). 

To  examine  possible  effects  of  outliers  on  test  validities,  Orr  et  al.  (1991) 
investigated  the  effects  of  outlier  removal  on  the  validity  of  selected  General 
Aptitude  Test  Battery  (GATB;  see  Hunter,  1980  for  a  description  of  the  GATB) 
tests  in  a  set  of  183  studies.  They  found  that  (a)  'outlying  data  points  were 
not.. .a  substantial  source  of  variance’  (p.  473),  (b)  removing  outliers  had  little 
effect  on  mean  validities,  and  (c)  removing  outlying  data  points  often  reduced, 
rather  than  increased,  test  validities.  There  are  at  least  three  explanations  for 
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these  somewhat  unexpected  findings.  First,  deleted  observations  may  have 
been  extreme  but  not  outlying.  Thus,  the  deleted  observations  may  not  have 
been  outliers  but  well  behaved  data  points  instead.  Second,  deleting  observations 
lying  in  the  tails  of  the  score  distribution  may  have  artifactually  restricted  the 
range  of  the  score  variance  and  consequently  reduced  validity.  Finally,  the 
type  of  paper-and-pencil  scores  studied  by  Orr  et  al.  (1991)  may  not  be  as 
sensitive  to  the  threat  of  outliers  as  other  types  of  scores  (e.g.,  response 
latencies,  tracking  error).  Paper-and-pencil  tests  typically  are  scored  in  terms 

of  percentages  (e.g.,  percent  correct,  percent  completed),  and  may  be  converted 
to  some  other  metric  (e.g.,  z-scores,  percentiles)  for  interpretive  ease,  so  that 
out-of-range  values  are  easily  recognized  and  corrected,  and  scores  are  bounded 
by  admissible  values  (e.g.,  0%  to  100%  correct,  1st  through  the  99th  percentile). 

Outliers  may  be  more  influential  in  other  types  of  measures  such  as  response 
latencies  and/or  measures  of  psychomotor  performance.  Response  latencies 

characteristically  are  positively  skewed  and  often  contain  outlying  “long"  responses 
(Luce,  1986;  Teichner  &  Krebs,  1972,  1974).  This  becomes  an  issue  because 
along  with  simple  adaptation  of  paper-and-pencil  tests  to  computer  administration, 
computerized  test  scoring  allows  the  measurement  of  additional  dimensions  of 
performance  beyond  simple  accuracy  scores  (correct/incorrect)  such  as  response 
latency  and  tracking  error.  Thus,  while  outliers  may  not  be  pervasive  and 
influential  in  traditional  paper-and-pencil  tests,  they  might  be  more  so  on 
computerized  tests,  and  particularly  on  performance  measures  which 
characteristically  are  nonnormal  (Green,  1988). 

There  are  two  further  issues  with  respect  to  outliers  in  response  latency 
and  tracking  error  data  which  ordinarily  are  not  issues  for  tests  containing  items 
scored  correct/incorrect:  (a)  level  of  analysis,  and  (b)  whether  data  should  be 
deleted  or  transformed  to  treat  outliers.  First,  outliers  simply  do  not  exist  at 
the  item  level  for  tests  scored  correct/incorrect.  An  individual’s  response  is 
either  right,  or  it  is  wrong.  For  these  tests,  outliers  exist  only  at  the  total 
score  level.  Outlying  respondents  (a)  answer  a  large  proportion  of  items 

correctly  or  incorrectly  or  (b)  skip  or  do  not  attempt  many  items.  On  the  other 

hand,  outlying  response  latencies  can  be  extremely  short  (e.g.,  an  anticipatory 
response),  or  extremely  long  on  individual  test  items.  Similarly,  tracking  errors 
can  be  very  small  or  very  large  for  particular  scoring  intervals  (i.e.,  time 
segments)  as  well  as  the  test  overall.  Thus,  the  issue  is  whether  outlying 
responses  should  be  treated  at  the  item  level,  or  only  at  the  total  score  level, 
given  that  a  sufficient  number  of  item-level  outliers  cumulate  to  produce  an 
outlying  total  test  score. 

The  second  issue  is  whether  outlying  data,  either  at  the  item-  or  total-score 
level  should  be  deleted,  or  whether  the  data  should  be  transformed  so  that 
possible  ill  effects  of  outlying  data  points  on  model  parameter  estimates  are 
reduced.  Much  of  the  literature  on  outliers  and  influential  data  points  has 
focused  on  how  to  identify  data  points  to  delete  (e.g.,  Chatterjee  &  Hadi,  1986; 
Cook  &  Weisberg,  1982),  with  little  or  no  attention  paid  to  the  effects  of  data 
transformations.  On  the  other  hand,  researchers  accustomed  to  working  with 
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reaction  time  data  routinely  effect  logarithm  transformations  to  more  nearly 
normalize  the  data,  and  rarely  delete  data  points  (Luce,  1986). 


Purpose 

Computerized  testing  permits  the  measurement  of  additional  dimensions  of 
performance  beyond  those  measured  by  traditional  paper-and-pencil  tests  such 
as  response  latency  and  psychomotor  performance.  However,  the  availability 
of  these  measures  raises  issues  of  how  the  data  should  be  treated 
psychometricaliy.  This  study  focused  on  the  treatment  of  outlying  data  on  a 
representative  computerized  test  battery  (Basic  Attributes  Test  or  BAT;  see 
Carretta,  1990). 


Measures 

The  BAT  battery  used  in  this  study  consisted  of  six  computerized  tests  that 
assessed  individual  differences  in  psychomotor  coordination,  (rotary  pursuit, 
compensatory  tracking),  information  processing  ability  (spatial  transformation, 
short-term  memory,  and  time  sharing  ability),  and  attitudes  toward  risk.  The 
types  of  scores  generated  from  these  tests  include  tracking  error,  response 
time,  response  accuracy,  and  response  choice.  Several  studies  have  shown 
that  BAT  scores  are  useful  for  predicting  US  Air  Force  pilot  training  performance 
and  provide  incremental  validity  when  used  with  operational  selection  instruments 
such  as  the  AFOQT  (Bordelon  &  Kantor,  1986;  Caretta,  1989;  Kantor  &  Carretta, 
1988).  Operational  implementation  of  the  BAT  as  an  adjunct  to  current  pilot 
candidate  selection  methods  is  expected  in  1993  (Carretta,  1992).  A  brief 
description  o*  the  BAT  selection  tests  follows;  a  more  detailed  description  was 
provided  by  Carretta  (1989,  1990). 

Two-Hand  Coordination.  This  pursuit  tracking  task  was  used  to  measure 

multilimb  coordination  (Fleishman,  1964).  An  airplane  (target)  moved  in  a  fixed, 
elliptical  pattern  at  a  varying  rate.  The  subject  controlled  the  horizontal  and 
vertical  movement  of  a  ‘gunsighf  using  the  right  (horizontal)  and  left  (vertical) 
control  sticks.  The  subject's  task  was  io  keep  the  gunsight  on  the  target. 
Horizontal  and  vertical  tracking  error  was  scored  for  each  of  ten,  30  second 

intervals  (n  =  2,451). 

Complex  Coordination  This  compensatory  tracking  task  was  an  example 
of  control  precision  and  multilimb  coordination  (Fleishman,  1964).  The  dual-axis 
right  control  stick  was  used  to  control  the  horizontal  and  vertical  movement  of 
a  cursor.  The  left  control  stick  was  used  t  j  control  the  left-right  movement  of 
a  ‘rudder  bar*  at  the  base  of  the  screen.  The  subject’s  task  was  to  maintain 
the  cursor  (against  a  constant  horizontal  and  vertical  rate  bias)  centered  on  a 
large  cross  fixed  at  the  center  of  the  screen,  while  simultaneously  centering 
the  rudder  bar  at  the  base  of  the  screen  (also,  against  a  constant  rate  bias). 

Horizontal  and  vertical  tracking  error  was  scored  for  each  of  ten,  30  second 

intervals  (n  =  2,451). 
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Mental  Rotation.  This  was  a  variation  of  a  spatial  transformation  task 
(Shepard  &  Metzler,  1971).  The  subject  was  presented  sequentially  with  two 
letters  and  was  required  to  make  a  same-different  judgment.  The  letter  pair 
consisted  of  either  same  or  minor  images  and  the  letters  were  either  in  the 
same  orientation  or  rotated  in  relation  to  each  other.  A  correct  “different” 
judgment  is  associated  with  a  mirror  image  pair  and  is  not  dependent  on  the 
relative  rotation  of  the  two  letters.  Response  speed  and  accuracy  were  scored 
for  each  of  the  72  items  for  this  test  (n  =  2,147). 

Item  Recognition.  This  measure  of  short-term  memory  was  based  on  a 
task  proposed  by  Sternberg  (1966).  A  string  of  1  to  6  digits  was  presented 
on  the  screen.  The  string  was  then  removed,  and  after  a  brief  delay,  replaced 
by  a  single  digit.  The  subject  was  instructed  to  remember  the  digit  string  and 
determine  whether  the  single  digit  was  one  of  those  presented  in  the  digit 
string.  Response  speed  and  accuracy  were  scored  for  each  of  the  48  items 
for  this  test  (n  =  2,209). 

Time  Sharing.  This  test  provided  a  measure  of  time  sharing  performance 
(North  &  Gopher,  1976).  In  the  first  10  minutes  of  this  test,  the  subject  was 
required  to  keep  a  randomly  moving  “gunsight”  on  an  airplane  (target)  using 
the  right-hand  control  stick.  In  the  next  six  minutes,  the  subject  had  to  repeat 
the  tracking  task  and  simultaneously  cancel  digits  which  appeared  at  random 
intervals  and  locations  on  the  screen.  Digit  cancellation  was  timed  and  consisted 
of  pressing  the  same  digit  on  the  numeric  keypad.  The  final  three  minutes  of 
the  test  consisted  of  only  the  tracking  task.  Tracking  difficulty  was  varied  by 
increasing  or  decreasing  the  control  stick  sensitivity  as  a  function  of  the  tracking 
error.  Scores  for  this  test  included  tracking  difficulty  and  response  time  (n  = 
2,356). 

Activities  Interest  Inventory.  This  test  was  designed  to  measure  the  subject’s 
attitudes  toward  risk-taking  (Mullins,  1962).  The  subject  was  presented  with 
81  pairs  of  activities  and  was  asked  to  choose  between  them.  The  activity 
pairs  forced  the  subject  to  choose  between  activities  that  differed  as  to  degree 
of  threat  (sometimes  subtly,  sometimes  not).  Response  speed  and  response 
choice  wore  scored  for  each  item  (n  =  2,355). 


Apparatus 

The  test  apparatus  consists  of  a  microcomputer  and  monitor  built  into  a 
ruggeaized  chassis  with  a  giare  shield  and  side  panels  designed  to  minimize 
distractions.  The  subjects  responded  to  the  tests  by  manipulating  individually 
or  ir  combination,  a  dual-axis  control  stick  on  the  right  side,  a  single-axis 
control  stick  on  the  left  side,  and  a  keypad  in  the  center  of  the  test  unit.  The 
keypad  included  keys  labeled  0  to  9,  an  ENABLE  key  in  the  center,  and  a 
bottom  row  with  YES  and  NO  keys,  and  two  others  for  same/different  responses 
(S/D),  and  left/right  responses  (L/R). 
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Procedure 


All  subjects  were  enrolled  in  a  4-year  college  program.  They  were  tested 
on  the  AFOQT  either  prior  to  entering  coliege  or  while  an  undergraduate  and 
were  tested  on  the  BAT  battery  at  the  beginning  of  a  flight  screening  program 
prior  to  receiving  flying  training. 

The  tests  used  in  this  study  were  part  of  a  longer  test  battery  that  required 
about  four  hours  to  complete  including  programmed  breaks  between  the  tests. 
After  the  test  administrator  briefed  the  subjects  and  initialized  the  test  battery, 
the  test  session  was  self-paced  by  each  subject.  Programmed  breaks  of  one 
or  two  minutes  between  tests  were  included  in  order  to  reduce  mental  and 
physical  fatigue. 


Training  Performance 

Undergraduate  Pilot  Training  (UPT)  is  a  53  week  program  involving  a  T-37 
phase  (initial  jet  trainer,  21  weeks)  and  a  T-38  phase  (advanced  jet  trainer,  32 
weeks).  UPT  final  outcome  was  awarded  at  the  end  of  the  program  and  was 
scored  as  a  dichotomous  variable.  Graduates  received  a  score  of  1  and 
eliminees  received  a  score  of  0. 


Identification  and  Deletion  of  Outliers 

Although  there  are  numerous  methods  for  detecting  outlying  data  points 
(e.g.,  Chatterjee  &  Hadi,  1986),  graphical  methods  still  are  among  the  most 
effective.  We  used  univariate  frequency  distributions  to  identify  outliers  both 
at  the  item-  and  total-score  level.  Specifically,  they  were  examined  for  points 
at  which  there  were  discontinuities  in  the  frequency  distributions,  beyond  which 
very  few  data  points  lay.  In  most  cases,  this  corresponded  closely  to  the  1st 
and/or  99th  percentiles  of  the  distribution.  Outlying  data  points  were  defined 
to  lie  beyond  these  points.  Sets  of  total  scores  were  created  both  with  outlying 
item  scores  included  and  deleted.  Also,  total  scores  that  included  outlying  item 
scores  were  themselves  examined  for  outlying  values  and  were  censored  for 
outliers  at  the  total  score  level. 


Data  Transformations 

Tracking  error  and  response  latency  data  were  markedly  positively  skewed. 
For  this  type  of  distribution,  Mosteller  and  Tukey  (1977)  recommend  either  a 
square-root  rr  a  natural  log  transformation  to  more  closely  normalize  the  data. 
We  effected  both  transformations  at  both  the  item-  and  total-score  level.  Thus, 
two  sets  of  total  scores  were  created.  The  first  was  based  on  nontransformed, 
square  root  transformed,  or  log  transformed  item-level  data,  that  were  summed 
to  form  a  total  score  (item-level).  The  second  was  based  on  nontransformed 
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item-'evel  data  that  were  summed  to  form  a  total  score  and  then  remained 
nontransformed  or  was  square  root  or  log  transformed  (total  score). 

Analysis 

Internal  consistency  reliability  (Cronbach’s  alpha  [Cronbach,  1951])  was 
estimated  for  each  BAT  score  for  both  nontransformed  and  transformed  data. 
Cronbach’s  alpha  is  the  most  widely  cited  measure  of  internal  consistency.  It 
is  the  average  of  all  split-half  reliability  coefficients,  a  measure  of  test  homogeneity, 
and  an  estimate  of  first  factor  saturation  (Stanley,  1971).  Test  score  validities 
were  estimated  by  correlating  nontransformed  and  transformed  BAT  summary 
scores  with  UPT  final  outcome  (graduation  or  elimination). 


RESULTS 

Between  0.27%  (Item  Recognition  response  time)  and  1 5.65%  (Mental 
Rotation  response  time)  of  the  subjects  were  identified  as  having  outlying  values 
at  the  item-level  for  nontransformed  response  latency  or  tracking  performance 
measures.  The  mean  percent  of  subjects  having  outlying  item-level  data  was 
3.98%  for  these  scores.  The  proportion  of  subjects  with  outlying  values  at  the 
total-score  level  for  nontransformed  data  ranged  from  0.13%  (Activities  Interest 
Inventory  average  response  time)  to  2.23%  (Complex  Coordination,  horizontal 
tracking  error)  with  a  mean  of  0.76%. 

The  treatment  of  outliers  at  the  item-level  (inclusion  or  exclusion; 

nontransformed  or  transformed)  had  little  effect  on  the  internal  consistency 

estimates.  Internal  consistency  estimates  (Cronbach’s  alpha),  shown  in  Table 

1,  were  acceptably  high  for  all  scores.  The  mean  internal  consistency  ranged 
from  .929  to  .947  across  the  outlier  treatment  methods. 

Correlations  between  UPT  final  outcome  and  test  scores  based  on  item 
level  and  total-score  level  data  are  summarized  in  Tables  2  and  3. 

Neither  deleting  outliers  nor  transforming  data  at  the  item-  or  total-score 

level  had  much  impact  on  the  test  score  correlations  with  UPT  final  outcome. 
In  the  case  of  the  test  scores  based  on  item-level  data  (Table  2),  outlier  deletion 
and  data  transformation  actually  lowered  many  validity  coefficients. 


DISCUSSION 

In  particular,  it  is  noteworthy  that  the  inclusion  or  removal  of  outliers  had 
little  influence  on  the  internal  consistency  and  predictive  validity  of  the  BAT 
scores.  This  is  important  as  these  tests  are  expected  to  become  operational 
US  Air  Force  pilot  candidate  selection  instruments  in  the  near  future  (Carretta, 
1992).  Under  operational  conditions,  it  will  be  desirable  for  all  pilot  training 
applicants  to  receive  meaningful  test  scores  so  that  personnel  selection  decisions 
can  be  made. 
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Table  1.  Coefficient  Alpha  for  Nontransformed  and  Transformed  Scores 
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MEAN _ £41 _ .929  .947  .932  .939  .937 

Note  The  column  labeled  “N  Items"  refers  to  time  intervals  for  the  tests  involving  tracking  performance  (Two-Hand  Coordination  Come 
Coordination,  and  Time  Sharing). 
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Contrary  to  textbook  examples  of  how  extreme  data  points  (outliers)  can 
be  unduly  influential  in  model  parameter  estimates,  this  is  the  second  study 

(in  addition  to  Orr  et  al. ,  1991)  which  has  shown  that  in  the  area  of  personnel 
testing,  they  generally  are  not.  There  may  be  several  reasons  why. 

First,  outliers  may  occur  with  less  frequency  than  might  be  expected.  For 
example,  Orr  et  al.  (1991)  observed  that  many  samples  of  GATB  data  failed 
to  contain  any  cases  that  qualified  as  an  outlier  according  to  statistical  criteria. 
Thus,  the  presence  of  outliers  may  simply  be  less  of  a  problem  than  some 

have  thought.  Second,  extreme  data  points  may  not  bo  outlying  as  often  as 
they  are  diagnosed.  As  an  example  from  the  present  context,  individuals  who 
have  extremely  long  response  times  to  experimeniai  tasks  rnay  not  make  suitable 
pilots  if  they  also  respond  very  slowly  to  information  received  in  the  cockpit 
(or  worse,  they  may  not  be  pilots  for  long!).  Third,  correlational  methods  may 

be  robust  over  attempts  to  treat  outliers.  For  example,  monotonic  data 

transformations  such  as  the  square-root  or  logarithmic  affect  the  shape  of  the 
data  distribution  but  do  not  alter  the  rank-ordering  of  the  observations. 

Results  from  this  study  should  be  interpreted  as  indicating  that  outliers  do 
not  threaten  the  integrity  of  research  results,  basic  or  applied.  Indeed,  both 
the  Orr  et  al.  (1991)  study  and  the  present  one  were  conducted  in  the  Federal 
government  under  research  programs  where  great  care  was  taken  in  the 
collection  and  preparation  of  the  data  bases.  Thus,  problems  of  “out  of  range" 
data  were  minimized  in  both  cases.  Results  do  seem  to  suggest,  however, 
that  within  carefully  constructed  data  sets,  threats  of  tha  harmful  effects  of 
outliers  may  not  be  as  serious  as  some  have  imagined. 
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